Week 8 - Support Vector Machines¶

Dr. David Elliott¶

1.1. Introduction

1.2. Dataset Example

1.3. What is a Hyperplane?

1.4. Maximal Margin Classifier

PLAN

  • Images from the Hands on machine learning (using the petal data)
  • some explanation from the python ML
  • structure of the Intro to stats learning & some algebra
  • Some algebra from Machine learning: a probabilitistic perspective.

TODO

  • look through https://github.com/JWarmenhoven/ISLR-python/blob/master/Notebooks/Chapter%209.ipynb, and https://scikit-learn.org/stable/auto_examples/exercises/plot_iris_exercise.html#sphx-glr-auto-examples-exercises-plot-iris-exercise-py
  • Tidy up the code... I dont need to worry about code blocks as much now as I don't need to show any code on a slide now!

1.1. Introduction ¶

The term Support Vector Machines (SVM's) is sometimes used loosely to refer to three methods, each an extension of the previous method1:

  • Maximal margin classifier,
  • Support vector classifier,
  • Support vector machine.

SVM's are a common supervised discriminative algorithm, well suited to complex small- to medium sized datasets2

They can be used for both classification and regression.

1.2. Dataset Example ¶

To demonstrate SVM's I'll be using Fisher's (or Anderson's) Iris flowers dataset3.

The dataset consists of 50 samples each from three species of Iris (Iris setosa, Iris virginica and Iris versicolor).

  • "...This data comes from a famous experiment on a series of measurements of three species of iris flowers. R A Fisher, a statistically minded thinker in the early twentieth centure used this dataset in his 1936 paper The Use of multiple measurements in taxonomic problems, published in the Annals of Eugenics." https://bookdown.org/koehlerj/qr_book/introducing-the-iris-dataset.html

  • "The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician, eugenicist, and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species." https://cloudxlab.com/assessment/displayslide/5165/keras-project-iris-flower-identification-introduction-to-iris-dataset

Figure 1: Iris Flowers

Five attributes were collected for the 150 records:

  • sepal length (cm)
  • sepal width (cm)
  • petal length (cm)
  • petal width (cm)
  • species
Figure 2: Iris Attributes

Notes

  • The middle thick line is a hyperplane and the dashed outer lines are the edges of the margin.
  • The circled points are the class examples that fall on the margin. These will later be termed support vectors.

1.3. What is a Hyperplane? ¶

In $p$-dimensional space, a hyperplane is a flat affine subspace of dimension $p-1$.

Examples

  • Two-dimensions: A flat one-dimensional line
  • Three-dimensions: A three-dimensional subspace
  • P-dimensions: A P-dimensional subspace

Notes

  • affine just means "does not need to pass through the origin"

  • Figure Examples

    • Two-dimensions: Figures 5 & 6
    • Three-dimension: Figure 7
    • P-dimensions: ...you'll have to use your imagination.
  • If you want to interact with the 3D plot use %matplotlib notebook

TODO

  • Make the below function more interactive and add a title

In more detail1:

Two-dimensions: $\beta_0 + \beta_1X_1 + \beta_2X_2 = 0$

Three-dimensions: $\beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 = 0$

P-dimensions: $\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p = 0$


If $X = (X_1, ..., X_p)^T$ satisfies above, then it is a point on the hyperplane.


If $\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p > 0$, it lies on one side of the hyperplane,

so if, $\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p < 0$, its on the other side.

Notes

  • In-other-words, $p$-dimensional space is dividided into two halves.

Classifying data¶

We aim to classify an $n \times p$ matrix of $n$ observations in $p$ dimensional space, with these observations falling into two classes $y_1,...,y_n \in \{-1,1\}$.

If we were to perfectly separate the classes the hyperplane would have the property that

$\beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + ... + \beta_pX_{ip} > 0 \quad \text{if} \ y_i = 1$,

$\beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + ... + \beta_pX_{ip} < 0 \quad \text{if} \ y_i = -1$.


For a new test observations $x^*$, we would look at the sign of:

$$f(x^*) = \beta_0 + \beta_1x_1^* + \beta_2x_2^* + ... + \beta_px_p^*.$$

We would assign it to class 1 if $f(x^*)$ is positive and class -1 if negative.

Furthermore, we could use the magnitude of $f(x^*)$ to indicate how far the point lies from the hyperplane.

1.4. Maximal Margin Classifier ¶

We need a reasonable way of constucting a hyperplane, out of the possible choices.

Maximal margin hyperplanes look at getting the hyperplane that is the furthest from the training observations - we compute the perpendicular distance from each training observation to a given separating hyperplane.

The maximal margin hyperplane is the separating hyperplane for which the margin is largest.

We hope the classifier with a large margin on the training data will generalise well to unseen test observations.

TODO

  • Tidy the code below
  • add references where needed (probs intro to stats learning)

Another way of defining our hyperplane is:

$$\mathbf{w}^T\mathbf{x} + b = 0.$$

Our margins, where our labels $y \in \{-1,1\}$, are then

$$\mathbf{w}^T\mathbf{x} + b \geq 0 \text{ for } y_i +1$$$$\mathbf{w}^T\mathbf{x} + b < 0 \text{ for } y_i -1$$

Notes

  • $\mathbf{w}$ is a weight vector
  • $\mathbf{x}$ is an input vector
  • $b$ is our bias

We can see below there are two equidistant points from the maximal margin hyperplane, lying on the dashed lines. These observations are called Support Vectors.

We can call these two points, $x_1$ and $x_2$ as below:

$$\mathbf{w}^Tx_1 + b = 1,$$$$\mathbf{w}^Tx_2 + b = -1.$$

If we move our support vectors our hyperplane will too.

This is because the maximal margin hyperplane only depends on these support vectors.

Other data points could be moved without the hyperplane moving.

We want to maximise the distance between the margin lines, on which the points lie4.

$$x_1-x_2$$


$$ \frac{\phantom{\quad}\mathbf{w}^Tx_1 + b = 1\\ - \mathbf{w}^Tx_2 + b = -1 \\}{\\ \mathbf{w}^T(x_1 - x_2) = 2} $$


$$ \frac{\mathbf{w}^T}{||\mathbf{w}||}(x_1 - x_2) = \frac{2}{||\mathbf{w}||} $$

Notes

  1. we want the distance between $x_1$ and $x_2$
  2. our $b$ cancels out and we end up with equation 3
  3. we want to remove $\mathbf{w}^T$
  4. we do this with norm of $\mathbf{w}^T$. However we cannot remove $||\mathbf{w}||$, so we are left to maximise $\frac{2}{||\mathbf{w}||}$.
  • $||\mathbf{w}||$ is the Euclidean norm of $\mathbf{w}$

$\text{max}\frac{2}{||\mathbf{w}||}$, while classifying everything correctly, $y_i(\mathbf{w}^T\mathbf{x}_i+b) \geq 1 \quad \forall_i$.

As our two margin lines are parallel and no training points fall between them, we can find the pair of hyperplanes which gives the maximum margin by minimizing $||\mathbf{w}||^2$. So now we have5:

${\text{min} \atop \mathbf{w}, b}\frac{1}{2}||\mathbf{w}||^2 \quad \text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i+b) \geq 1 \quad \forall_i$,

which is easier because this is a convex quadratic optimisation problem which is efficiently solvable using quadratic programming algorithms.

Notes

  • $\frac{1}{2}$ is added for convience.

Primal Problem¶

This requires a Lagrangian formulation of the problem so we introduce Lagrange multipliers, $\alpha_i, i = 1, \ldots , l$, where $l$ is the number of training points:

$$ \min L_P = \frac{1}{2}||\mathbf{w}||^2 - \sum^l_{i=1} \alpha_iy_i(\mathbf{x}_i\cdot\mathbf{w}+b) + \sum^l_{i=1} \alpha_i \qquad \text{s.t.} \quad \forall_i \alpha_i \geq 0 $$

Duel Problem6,7¶

Removes the dependence on $\mathbf{w}$ and $b$,

$$ \max L_D = \sum_i\alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_jy_iy_j\mathbf{x}_i\cdot \mathbf{x}_j \qquad \text{s.t.} \quad \forall_i \alpha_i \geq 0. $$

To achive this we first need to set a constaint that allows us to solve for $\alpha_i$,

$$\sum_i\alpha_iy_i = 0.$$

Knowing our $\alpha_i$ means we can find the weights, which are a linear combination of the training inputs, $\mathbf{x}_i$, training outputs, $y_i$, and the values of $\alpha$,

$$\mathbf{w} = \sum^{N_S}_{i=1}\alpha_iy_i\mathbf{x}_i,$$

where $N_S$ is the number of support vectors6.

$b$ is then implicitly determined by choosing any $i$ for which $\alpha \neq 0$ and computing $b$ using the following Karush-Kuhn-Tucker (KKT) condition: $$ \alpha_i(y_i(\mathbf{w} \cdot \mathbf{x}_i + b) - 1) = 0 \quad \forall i $$

Notes

  • $\sum_i\alpha_iy_i = 0$ removes our $b$
  • There is a Lagrange multiplier $\alpha_i$ for every training point. Points for which $\alpha_i > 0$ are called "support vectors", and lie on one of the margins, with all other training points having $\alpha_i = 0$.
  • Notice this means we only need to focus on those with $\alpha_i > 0$.
  • $\alpha_i$ will only be zero if all training points have the same class
  • Solving the SVM problem is equivalent to finding a solution to the Karush-Kuhn-Tucker (KKT) conditions.
  • it is numerically safer to take the mean value of $b$ resulting from all such equations.
  • For more reading on quadratic programming for support vector machines, I recommend you read:
    • Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), 121-167.
    • https://www.adeveloperdiary.com/data-science/machine-learning/support-vector-machines-for-beginners-duality-problem/

Inner products, similarity, and SVMs8¶

Question: But why bother doing this? That was a lot of effort, why not just solve the original problem?

Answer: Because this will let us solve the problem by computing the just the inner products of $\mathbf{x}_i$, $\mathbf{x}_j$ which will be important when we want to solve non-linearly separable classification problems.

  • Inner products provide a measure of "similarity"
  • Inner product in 2D between 2 vectors of unit length returns the cosine of the angle between them = how "far apart" they are.
    • if they are parallel their inner product is 1 (completely similar), $\mathbf{x}^T\mathbf{y} = \mathbf{x} \cdot \mathbf{y} = 1$.
    • If they are perpendicular (completely unlike) their inner product is 0 (so should not contribute to the correct classifier), $\mathbf{x}^T\mathbf{y} = \mathbf{x} \cdot \mathbf{y} = 0$.
$$ L_D(\alpha_i) = \sum_i^l\alpha_i - \frac{1}{2}\sum_{i,j}^l\alpha_i\alpha_jy_iy_j\mathbf{x}_i\cdot \mathbf{x}_j \qquad \text{s.t.} \quad \forall_i \alpha_i \geq 0, \ \sum_i^l\alpha_iy_i = 0. $$

Case 1 Two features $x_i$, $x_j$ are completely dissimilar (orthogonal), so their dot product is 0. They don't contribute to $L$.

Case 2 Two features $x_i$, $x_j$ are similar and predict the same output value $y_i$ (ie. both $+1$ or $-1$). This means $y_i \times y_j$ is 1 and the value $\alpha_i\alpha_jy_iy_j\mathbf{x}_i$ is positive. However this decreases the value of $L$, due to subtracting from the first term sum, $\sum_i^l\alpha_i$, so the algorithm downgrades simular feature vectors that make the same prediction.

Case 3 : Two features $x_i$, $x_j$ predict opposite predictions about the output value $y_i$ (ie. one $+1$ and the other $-1$), but are otherwise similar. The value $\alpha_i\alpha_jy_iy_j\mathbf{x}_i$ is negative and since we are subtracting it, this adds to the sum. These are the examples we are looking for as they are critical for telling the two classes apart.

(-1.2089385360856464, 1.519364214389392)

Test Phase¶

"Once we have trained a Support Vector Machine, how can we use it? We simply determine on which side of the decision boundary (that hyperplane lying half way between H1 and H2 and parallel to them) a given test pattern $\mathbf{x}$ lies and assign the corresponding class label, i.e. we take the class of x to be $\text{sgn}(\mathbf{w} \cdot \mathbf{x} + b)$." Burges (1998)

Notes

  • Remember most weights $\mathbf{w}_i$ will be 0 as only support vectors (on the margin or gutters) will have none 0 weights.

Limitations¶

Maximal Margin Classifiers are sensitive to outliers. In figure X, we can see that the reliance on a small number of observations means there is now a small margin. We want to be confident that a distance from the hyperlane is a measure of our confidence in its classificaion, and that we have no overfit to our training data.

In other cases, no exact linear separating hyperplane exists. Therefore we may want to use a hyperplane that almost separates the two classes, allowing some errors, using a soft margin (Support Vector Classifier).

Furthermore, if we have a large number of features, this approach often leads to overfitting.

Associated Exercises¶

Now might be a good time to try exercises X-X.

References¶

  1. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: springer.
  2. Géron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.".
  3. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2), 179-188.
  4. [youtube vid]
  5. Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.
  6. Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), 121-167.
  7. Fletcher, R. Practical Methods of Optimization. John Wiley and Sons, Inc., 2nd edition, 1987
  8. http://web.mit.edu/6.034/wwwbob/svm-notes-long-08.pdf